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CROSS-ENTROPY  MINIMIZATION  GIVEN  FULLY-DECOMPOSABLE 
SUBSET  AND  AGGREGATE  CONSTRAINTS 

I.  INTRODUCTION 

The  principle  of  minimum  cross-entropy  provides  a  general  method  of 
inference  about  an  unknown  probability  distribution  q*  =  qt  ,qt, ,  •  •  •  ,q£, 

v  12  u 

when  there  exists  a  prior  estimate  of  q*  as  well  as  new  information  in  the 
form  of  known  expected  values  The  principle  states  that,  of 

all  the  densities  with  the  correct  expected  values,  one  should  choose  the 
posterior  q  with  the  least  cross-entropy  H[q,p]  =  21*  q_.  log(q,.  /p. ) , 

“  v  <v  1  *  1  1 1  1 

where  p  is  a  prior  estimate  of  q^  .  Cross-entropy  minimization  was  first 

introduced  by  Kullback  [l],  who  called  it  minimum  directed  divergence  and 

minimum  discrimination  information.  The  principle  of  maximum  entropy  [2],  [3] 

is  equivalent  to  cross-entropy  minimization  in  the  special  case  of  uniform 

priors.  Cross-entropy  minimization  has  a  long  history  of  applications  in  a 

variety  of  fields  (for  a  list  of  references,  see  [4]).  Recently,  results  have 

been  obtained  for  spectral  analysis  [5],  speech  coding  [6],  pattern 

recognition  [7],  queuing  theory  [8], [9]  and  computer  system  modeling 

[10], [11].  For  discussions  of  the  the  background,  validity,  and  properties  of 

cross-entropy  minimization,  see  [ 1] , [4] , [ 12] , [ 13] . 

There  are  a  number  of  general  algorithms  for  finding  minimum  cross -entropy 

distributions  given  arbitrary  priors  and  arbitrary  constraints  [14]-[16]. 

Most  of  these  algorithms  are  based  on  the  Newton-Raphson  method  [13, 

Appendix  A],  which  involves  a  matrix  inversion  during  each  iteration,  and  the 

computation  time  for  all  of  them  grows  rapidly  with  the  number  of  points  N. 

Such  rapid  growth  may  be  unavoidable  in  the  case  of  completely  general 

expectation  functions.  In  this  paper,  we  consider  a  less  general  case:  one  in 

which  the  known  expected  values  are  either  of  the  forms 
Manuscript  submitted  November  18,  1980. 
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(1) 
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or 
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*«  1 
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j"l 

where  the  IK ,  j  «  1,2,...,M,  are  disjoint  subsets  D j  C.  [l,2,...,N},  with 
DjUD^U'^UD^  *  [l,2,...,N^,  and  where  the  rt  are  given  by 


(2) 


i£D. 


That  is,  we  consider  the  case  in  which  one  knows  expected  values  conditional 
on  any  of  the  disjoint  subsets  IK  as  well  as  expected  values  of  the 
distribution  of  aggregate  probabilities  rt.  In  this  case,  we  shall  show 
that  some  special  properties  apply  and  that  one  can  solve  the  overall  minimum 
cross-entropy  problem  by  solving  a  minimum  cross-entropy  problem  for  each  of 
the- conditional  subset  distributions  followed  by  one  minimum  cross-entropy 
problem  for  the  aggregate  distribution.  If  all  of  the  subsets  have  equal 
sizes,  this  means  that,  instead  of  solving  one  N  dimensional  problem,  one  can 
solve  M  problems  of  dimension  N/M  followed  by  one  problem  of  dimension  M. 

The  results  presented  in  this  paper  may  be  particularly  useful  in 
applications  that  concern  queuing  networks  and  other  computer  system 
performance  models.  In  these  applications,  known  expected  values  often  arise 
from  rate  balance  equations.  Consider  a  simple  example:  For  a  M/M/l/N  queue 
with  state  probabilities  q£,  a  typical  rate-balance  equation  is 

ai +  Ai)qi  "  A*iqt+i +  ’ 
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where  V  and  yU.  respectively  are  state-dependent  arrival  and  service  rates. 
This  equation  is  just  a  special  case  of  the  general  form  2\f  ^qt.  For 
general  queuing  networks,  the  disjoint  subsets  in  ( 1 )— ( 3 )  can  correspond 
to  internal  device  states  and  the  rt  can  correspond  to  the  probabilities  of 
aggregate  device  states.  The  results  of  this  paper  apply  when  one  has  both 
rate  balance  equations  for  device  equilibrium  of  the  form  (1)  as  well  as  rate 
balance  equations  for  system  equilibrium  of  the  form  (2). 

Section  II  defines  the  notation  we  shall  use  ana  reviews  the  mathematics 
and  justification  of  cross-entropy  minimization.  In  Section  III,  we  summarize 
previous  results  concerning  minimum  cross-entropy  problems  with  subset  ana 
aggregate  constraints,  we  prove  several  new  properties,  and  we  show  how  the 
minimum  cross-entropy  problem  can  be  decomposed  into  smaller  problems. 
Computational  results  comparing  the  full  space  and  decomposed  methods  are 
presented  in  Section  IV. 


II.  BACKGROUND 


A,  Notation 

We  use  the  same  notation  as  in  [4], [13].  For  a  more  detailed  discussion 
of  technical  conditions  and  questions  related  to  the  existence  of  minimum 
cross-entropy  solutions,  see  [12], [13]. 

We  use  lower-case  boldface  Roman  letters  for  system  states,  which  may  be 
multidimensional,  and  upper-case  boldface  Roman  letters  for  sets  of  system 
states.  We  use  lower-case  Roman  letters  for  probability  densities,  and  upper 
case  script  letters  for  sets  of  probability  densities.  Thus,  let  x  be  a  state 
of  some  system  that  has  a  set  J)  of  possible  states.  Let  be  the  set  of  all 
probability  densities  q  on  Jg  such  that  q(x)^  0  for  x€jg  and 
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1  . 


(4) 


j  d5  q(x)  * 

B 

We  use  a  dagger  t  to  distinguish  the  system's  unknown  "true"  state  probability 
density  q^£^.  When  SC|)  is  some  set  of  states,  we  write  q(x6J)  for  the  set 
of  values  q(x)  with  x£S. 

New  information  takes  the  form  of  linear  equality  constraints 

j"  dx  q^jOf^x)  -  F^  ,  (i  -  1,2,. ..,K)  (5) 

D 


for  known  functions  f^  and  known  values  The  probability  densities 

that  satisfy  such  constraints  always  comprise  a  convex  subset  J  of  Tfc.  .  We 
refer  to  the  functions  f^  as  constraint  functions  and  to  $  as  a  constraint 
set.  For  a  given  constraint  set  there  may  of  course  be  more  than  one  set  of 
constraint  functions  in  terms  of  which  it  may  be  defined.  We  frequently 
suppress  mention  of  a  particular  set  of  constraint  functions,  using  the 
notation  I  *  (q^G  $  )  to  mean  that  q^  is  a  member  of  the  constraint  set  J 
and  referring  to  I  as  a  constraint  or  constraints.  We  use  upper-case  Roman 
letters  for  constraints.  The  results  in  this  paper  are  restricted  to  the  case 
of  equality  constraints.  Results  for  the  more  general  case  involving 
inequality  constraints  —  bounds  on  expected  values  —  are  discussed  in 
[1] » [4] » [12], [13]. 

Let  p£&  be  some  prior  density  that  is  an  estimate  of  q^  obtained,  by  any 
means,  prior  to  learning  I.  Priors  oust  be  strictly  positive:  p(jGJ))>  0  (for 
discussion,  see  [13]).  Given  a  prior  p  and  new  information  I,  the  posterior 
density  q€&  that  results  from  taking  I  into  account  is  chosen  by  minimizing 
the  cross-entropy  H[q,p]  in  the  constraint  set<j  : 

H[q,p]  -  min  H[q'  ,p]  ,  (6) 
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where 

f 

H[q,p]  =  >.dx  q(x)log(q(x)/p(x))  (7) 

. J 

D 

We  introduce  an  ‘‘information  operator**  o  that  expresses  (6)  using  the  notation 
q  -  p^I  .  (8) 

The  operator  takes  two  arguments  —  a  prior  and  new  information  —  and 
yields  a  posterior. 

For  some  subset  S9  D  of  states  and  x&S,  let 

q(x  [  x6S)  =  q(x)/  ^  djc'  q(x')  (9) 

S 

be  the  conditional  density,  given  x£S,  corresponding  to  any  q6  Q.  .  We  use 

q(x  J  xg-S)  =  q*S  (10) 

as  a  shorthand  notation  for  (9). 

When  D  is  a  discrete  set  of  system  states,  densities  are  replaced  by 
discrete  distributions  and  integrals  by  sums  in  the  usual  way.  We  use 
lowercase  boldface  roman  letters  for  discrete  probability  distributions,  which 
we  consider  to  be  vectors;  for  example,  q  *  9q^f . . . ,q^.  It  will 
always  be  clear  in  context  whether,  for  example,  the  symbol  v  refers  to  a 
system  state  or  a  discrete  distribution  and  whether  r^  refers  to  a 
probability  density  or  a  component  of  a  discrete  distribution. 


B.  Minimum  Cross-Entropy  Probability  Densities 

Given  a  positive  prior  probability  density  p,  if  there  exists  a  posterior 

that  minimizes  the  cross-entropy  (7)  and  satisfies  the  constraints  (4)  and 
(5),  then  it  has  the  form 


q(x)  *  p(x)  exp  ^ ^ 

j-1 


(11) 


In  (11),  and  pj  are  Lagrangian  multipliers  whose  values  are  determined  by 
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the  constraints  (4)  and  (5).  Conversely,  if  one  can  find  values  for  X  and 
in  (11)  such  that  the  constraints  (4)  and  (5)  are  satisfied,  then  the 
solution  exists  and  is  given  by  (11)  [12].  The  cross-entropy  at  the  minimum 
can  be  expressed  in  terms  of  the  Lagrangian  multipliers  and  the  expected 
values  Fj  as  follows  ([l,  p.  38],  [13]): 


H[q*p]  =  -A  - 


v 


Ms 


/  *>  *  r  *  • 

/-  [  J  J 

j-1 


It  is  necessary  to  choose  j£  and  the  ^  so  that  the  constraints  are 
satisfied.  In  the  presence  of  the  constraint  (4),  one  may  rewrite  the 
remaining  constraints  (5)  in  the  form 


(12) 


[  dx  (f . (x)  -  F. )q( x)  *  0 

\  --V  1  ~  1 


(13) 


Now,  if  one  finds  values  for  the  such  that 


\ 


idx  (f.(x)  -  F.)p(x)  exp  j-  S^,J.f.(x)y  =0  ,  (i  =  1,.*.,M),  (14) 

j  ~  1  ""  1  ~  v  i  J  J  ~  / 

J  j*l 

holds,  (13)  will  be  satisfied,  and  (4)  can  then  be  satisfied  by  setting 

K 


%  =  log  \  dx  p(x)  exp  j  -  >  'l.f.(x)  j 

J  j-1 


(15) 


If  the  integral  in  (15)  can  be  performed,  one  can  sometimes  find  values  for 
the  j'?.  from  the  relations 

-  =  F. 

>3-  J 

i  J 

It  unfortunately  is  usually  impossible  to  solve  this  or  (14)  for  the  3j 
explicitly,  in  order  to  obtain  a  closed-form  solution  expressed  directly  in 
terms  of  the  known  expected  values  F^  rather  than  in  terms  of  the  Lagrangian 
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multipliers.  Computational  methods  for  finding  approximate  solutions  are, 
however,  available  ([14] -[16]). 

When  the  prior  density  is  uniform  on  D,  minimizing  (7)  is  equivalent  to 
maximizing  the  entropy 

c 

-  j dx  q(x) log(q(x) ) 

D 

Minimum  cross-entropy  and  maximum  entropy  are  also  equivalent  when  the  prior 
is  exponential  in  a  linear  combination  of  the  constraint  functions.  In  both 
cases,  (11 )— (15)  all  apply  with  the  prior  deleted. 

C.  Justification  of  Cross-Entropy  Minimization 

In  what  sense  does  cross-entropy  minimization  yield  the  best  estimate  of 
q^?  To  answer  this  question,  it  is  useful  to  ask  what  would  happen  if  other 
functionals  besides  cross-entropy  (7)  were  used  in  implementing  the 
information  operator  «  in  (8).  Recent  work  has  shown  that,  if  the  operator  » 
is  required  to  satisfy  certain  axioms  of  consistent  inference,  and  if  c  is 
implemented  by  means  of  functional  minimization,  then  the  principle  of  minimum 
cross-entropy  follows  necessarily  [4].  Informally,  the  axioms  state  that 
different  ways  of  taking  the  information  I  into  account  —  for  example,  in 
different  coordinate  systems  —  should  lead  to  consistent  results.  In  terms 
of  these  axioms,  the  principle  of  cross-entropy  minimization  is  correct  in  the 
following  sense:  Given  a  prior  probability  density  and  new  information  in  the 
form  of  constraints  on  expected  values,  there  is  only  one  posterior  density 
satisfying  these  constraints  that  can  be  chosen  by  functional  minimization  in 
a  manner  that  satisfies  the  axioms;  this  unique  posterior  can  be  obtained  by 
minimizing  cross-entropy. 

An  additional  interpretation  of  the  sense  in  which  q  -  p*I  is  the  best 
estimate  of  rests  on  cross-entropy’s  well-known  [l]  and  unique  [ 1 7 ] 
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properties  as  an  information  measure.  Informally  speaking,  H[q,p]  is  a 
measure  of  the  "information  divergence"  or  "information  disimilarity"  between 
q  and  p.  In  these  terms,  one  can  interpret  the  principle  of  minimum 
cross-entropy  as  follows:  Since  q  =  pd  minimizes  H[q,p],  the  posterior 

«  -X-*  #  * 

hypothesis  for  q'  is  as  close  as  possible  in  an  in  format  ion -measure  sense  to 
the  prior  hypothesis  while  at  the  same  time  satisfying  the  new  constraints  I. 
Furthermore,  in  the  context  of  cross-entropy  minimization,  cross-entropy 
satisfies  the  triangle  equality  [I2],[l3] 

H[qf,p]  =  H[q+,P*I]  +  H[poI,p]  .  (16) 

Thus,  the  minimum  cross-en tro py  posterior  estimate  of  is  not  only  logically 
consistent,  but  also  closer  to  q^ ,  in  tne  cross-entropy  sense,  than  is  the 
prior  p.  Moreover,  the  difference  H[ , p] -H[ q 1 , pel]  is  exactly  the  cross¬ 
entropy  H[p°I,p]  between  the  posterior  and  the  prior.  Hence,  H[poI,pj  can  be 
interpreted  as  the  amount  of  information  provided  by  I  that  is  not  inherent  in 
p.  Stated  different ly,  H[poI,p]  is  the  amount  of  additional  distortion 
introduced  if  p  is  used  instead  of  pol.  Since,  for  any  density  r  there  exist 
constraints  I  such  that  r  =  p°Ir  for  any  prior  p,  H[r,p]  is  in  general 
the  amount  of  information  needed  to  determine  r  when  given  p,  or  the  amount  of 
add  itional  distortion  introduced  if  r  is  used  instead  of  p  [13]. 

Yet  another  justification  for  using  cross-entropy  as  a  distortion  measure 
in  the  context  of  cross-entropy  minimization  is  provided  by  the  "expectation 
matching"  property  [13],  which  states  that,  for  an  arbitrary  density  q* 
and  a  density  q  of  the  general  form  (11),  H[q*,q]  is  smallest  when  the 
expectations  of  q  match  those  of  q* .  In  particular,  it  follows  that  q  =  pol 
is  not  only  the  density  that  minimizes  H[q,p],  as  already  discussed,  but  also 
is  the  density  of  the  form  (11)  that  minimizes  H[q^,q],  Hence  p*I  is  not  only 
closer  to  q^”  than  is  p  —  as  shown  by  (16)  —  but  it  is  the  closest  possible 
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density  of  the  form  (11). 

III.  CROSS-ENTROPY  MINIMIZATION  WITH  SUBSET  AND  AGGREGATE  CONSTRAINTS 

In  this  Section,  we  introduce  the  special  classes  of  subset  and  aggregate 
constraints  and  we  summarize  properties  of  pol  that  apply  vrtien  I  consists 
either  of  subset  or  aggregate  constraints.  For  the  case  of  subset  constraints 
only,  we  then  show  how  pal  can  be  computed  using  a  decompos it  ion  method  that 
obviates  solving  a  minimum  cross  entropy  problem  on  the  entire  state  space  D. 
Next  we  derive  several  new  properties  that  apply  when  I  consists  of  both 
subset  and  aggregate  constraints  and  we  show  how  the  the  decomposition  method 
can  be  used  in  this  case  as  well. 


A.  Subset  and  Aggregate  Constraints 

Let  D, be  disjoint  subsets  whose  union  is  D  and  let 
^ 1 f~2 9  ’  ~ 

Sj  -  (q^*Dj€r  be  new  information  about  the  conditional  density 

q**Dj,  where  and  t^ie  set  densities  onjlj.  In 

particular,  suppose  that  S.  is  given  by  the  constraints 


\  dx  [q+*D.](x)  u..(x)  =  u.  . 

J  ~  LH  -J  -  Lj 


(i  =  1 , , . . ,Kj )  . 


D. 


These  can  always  be  written  in  the  form 


dx  [q^ *D.](x)  s. .(x)  =  0 

^  —  LJ  -v 


(i  =  1,...,K.)  , 


(17) 


where  s..(x)  *  u..(x)»u*..  Now  the  constraints  S,  can  also  be  written 

ij  ij  ^  ij  j 

as  constraints  Si  on  the  full  density  ,  namely  Si  * 

where  ^ l  c:  ^ .  In  particular,  the  constraints  Sj  in  (17)  correspond  to 


the  constraints  Si  given  by 
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|  dx  qf(jc)  f£j(x)  = 


(i  3  !>•••> Kj)  , 


(18) 


where 


(  s. . (x) 

lj  - 


,  x6 D. 

7  V  j 


(19) 


0  ,  otherwise 

We  write  S  =  S.AS0A-*ASW  as  well  as  S'  =  S !AS * A“# AS '  , 

i  Z  M  I  Z  M 

and  we  refer  to  S  or  S'  as  subset  constraints . 

When  new  information  consists  of  subset  constraints  only,  cross-entropy 
minimization  satisfies  a  property  known  as  weak  subset  independence  [13], 
which  states  that 


(poSf)*IK  =  (p*Dj)oS.j  (20) 

holds.  Given  subset  constraints  S  and  an  arbitrary  prior  p£Ti  ,  there  are  two 
ways  of  obtaining  posterior  conditional  densities  for  each  subset  IK:  One 
way  is  to  obtain  a  posterior  r  =  p^S1  for  the  whole  system  and  then  to  compute 
conditional  posteriors  r*D ^ .  Another  way  is  to  obtain  a  conditional 
posterior  (p*pj)oSj  from  each  conditional  prior  using  only  that  part  of  S 
which  refers  to  the  subset  D  ^  .  Eq.  (20)  states  that  the  results  are  the 
same  in  both  cases.  Furthermore,  the  cross-entropy  H[p®Sf,p]  satisfies 


M 

H[r,p]  =  H[yr,Yp]  +  £  [Vr^  H[r*Dj,p*D.] 

j  =  l 


(21) 


where  r  *  poS1,  and  whereof  is  a  subset  aggregation  transformation  such  that, 
for  any  q£  ft.  ,  IfJq  is  a  discrete  distribution  with 


[Vqlj  a  j  d£  q(x) 

D. 


T 


The  transformation  V  aggregates  the  states  in  each  subset  D ^ .  For  proofs  of 

(20) -(21),  see  [13].  In  fact,  it  is  easy  to  show  by  direct  calculation  that 

(21)  holds  for  arbitrary  densities  q,r£  &  . 

Now  consider  situations  in  which  there  is  information  about  the  aggregate 
distribution  'fq*  .  In  particular,  let  A  be  the  constraints 


M 

=  0  (i  -  l,...,Ka)  .  (22) 

j=l 

(Constraints  with  non-zero  right  hand  sides  can  always  be  written  in  this  form 
as  was  the  case  with  (17).)  The  aggregate  information  A  can  also  be  expressed 
as  information  A'  about  the  density  q* .  In  particular,  the  constraints  A  in 

(22)  correspond  to  the  constraints  A*  given  by 


j  dx  q*(x)a^(x)  =0  (i  =  1,...,K  )  .  (23) 

D 

where 

ai(5C‘Pj)  =  aij  *  (24) 

We  refer  to  A1  or  A  as  aggregate  constraints .  Given  such  aggregate 
constraints,  cross-entropy  minimization  satisfies  a  property  known  as  subset 
aggregation  [13],  which  states  that 


^CpeA' )  = 

(yp)*A 

(25) 

H[pcA’,p] 

-  H(t(p*A’),2p] 

(26) 

hold.  Thus,  the  aggregate  probabilities  of  the  posterior  p©A*  are  the  same  as 
those  obtained  by  aggregating  the  prior  and  then  taking  the  constraints  A  into 
account.  For  proofs  of  (25)-(26),  see  [13]. 


When  new  information  I  consists  entirely  of  subset  and  aggregate 
constraints,  we  refer  to  I  ■  SAA  or  I*  ■  S'AA1  as  fully  decomposable 


constraints . 


B.  Decomposition  method  of  Computing  poS1 

From  (20)  we  know  that  the  posterior  conditionals  of  r  a  poS'  can  be 

obtained  by  solving  the  minimum  cross-entropy  problems  (p*DOoSj  on  each 

subset.  If  yr  =*  Vp  were  true,  then  one  could  construct  the  full  posterior 

r  3  p^S'  without  solving  a  minimum  cross-entropy  problem  on  the  full  space  I). 

Unfortunately,  yr  =  Yp  does  not  hold  in  general  —  for  an  example,  see  the 

discussion  following  Property  8  in  [13].  Nevertheless,  we  shall  show  in  this 

Section  that  it  is  simple  to  compute  the  posterior  aggregate  distribution  Vr 

from  the  subset  results  (p*D.)t?S.  and  that  one  can  therefore  construct 

J 

p*S *  without  solving  a  minimum  cross-entropy  problem  on  the  full  space  D. 

From  (11)  and  (18)  it  follows  that  r  =  p©S*  is  given  by 


M  Kj 

x)  expf-e  -  Y  Y. 

X  j =1  i=l  ' 


r(x)  -  p( 


where  0  and  are  Lagrangian  multipliers.  Using  (9)  and  (19)  it  follows 
that  the  conditional  density  in  the  subset  D.  is 

r*D.  =  r.*r(x€D.) 

j  ^ 


r j 1 P (jjj)  exp (-9 


-7 


where  r.  =*  [Jjjr]j.  Now  the  conditional  prior  is  P*£j  **  where 

p.  ■  [Vp] . ,  and  it  follows  from  (11)  and  (17)  that 
J  *  J 

Kj 

(p*oj).sj  ■  Pj'pCjs)  ««p(-  • 

'  ‘  1 
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Owing  to  weak  subset  independence  (20),  it  follows  from  (28)-(29)  that 


.  .  =  V .  .  and 
1J  ^3 

r .  ■  p . exp(- £  +  £.  )  , 

J  J  3 


(30) 


provided  that  the  constraint  functions  s^  in  (17)  are  linearly  independent . 

Now,  on  the  right  side  of  (30),  the  p^  are  known  from  the  prior  and  the 

are  the  normalization  Lagrangian  multipliers  from  the  subset  conditional 

-.0 

densities  (p*Dj)"Sj.  Since  e  is  just  a  normalization  factor 
(  jTj  =  1),  it  follows  that  one  can  compute  the  posterior  aggregate 
probabilities  r^  from  (30)  using  p  and  the  results  of  (p*Dj)oSj.  From 
the  posterior  aggregates  and  the  posterior  conditionals,  one  can  then 
construct  the  full  posterior  r  =  p*S 1  since  r(x£Dj)  =  rj(r*J)j)*  This 
result  can  also  be  seen  by  noting  that  the  multipliers  and  0  in 

(27)  are  known  from  (29)-(30)  and  the  normalization  requirement  ^-jrj  58  !• 


C .  Cross-Entropy  Minimization  Given  Fully  Decomposable  Constraints 

In  this  subsection,  we  prove  the  following  new  properties  that  hold  in  the 
case  of  fully  decomposable  subset  and  aggregate  constraints: 


po(S,AA' )  58  (poS'^A1 

(31) 

(p*(S'AA'  ))*Dj  =*  (P*2j)*sj 

(32) 

£(p»(S'AA'))  =  (  vMpoS'))pA 

(33) 

convenience,  we  define 

r  =  p*S* 

(34a) 

u  =  (poS^oA1 

(34b) 

q  =  p&(S *A  A*  ) 

(34c) 

these  terras,  the  following  relations  also  hold: 

H[q,p]  «  H[r,p]  +  H[  Yq,  Yr] 

(35) 

M 

H[q,p]  =  H[  Yq,  tp]  +  ^  [y  q]j  H[q*Dj,p*Dj] 

(36) 

Discussion:  For  arbitrary  constraints  I  and  I.  ,  p«*(l  aO  is  not  in 
-  a  b  x  a  b 

general  the  same  as  (pol^ol^.  One  way  to  see  this  is  to  note  that,  while 
p»Ia  is  guaranteed  to  satisfy  1^,  (p0^)0^  is  guaranteed  only  to 
satisfy  may  remain  satisfied  but  only  in  special  cases.  Fully 

decomposable  constraints  are  an  example  of  such  a  case.  For  another  example, 
see  Property  4  in  [13].  Eq.  (32)  shows  that  the  conditional  densities  q*Dj 
can  be  obtained  by  solving  the  subset  problems  (p*JK)©Sj.  In  the  previous 
subsection  we  showed  how  one  can  use  the  resulting  solutions  to  find  the 
aggregate  distribution  Vr  =  Sj(poS').  Since  (33)  shows  that  one  can  find  the 
aggregate  distribution  yq  by  solving  a  minimum  cross-entropy  problem  using 
fr  as  a  prior,  it  follows  that  one  can  construct  q  without  solving  a  minimum 
cross-entropy  problem  on  the  full  space  D. 

Proofs:  We  begin  with  explicit  expressions  for  the  densities  (34).  The 
density  r  is  given  by  (27).  Hence,  the  density  u  is  given  by 


(37) 


M  Kj 

u(x)  =  p(x)  exp(-&-  ^  Vij(x)  "  ZLAai<iK))  ’ 

j-1  i=l  i=I  7 

where  ^  is  a  Lagrangian  multiplier  corresponding  to  a  normalization  constraint 
and  where  yU*  are  multipliers  corresponding  to  the  constraints  (23).  The 
density  q  is  given  by 


j=l  i=l  i-1 

where  l^i j  ’  an<*  are  f,a8rang^an  multipliers  corresponding  respectively 
to  (4),  (19),  and  (23). 

Now  it  follows  from  (9),  (19),  (24),  and  (38)  that  the  conditional  density 
q*£j  is 


14 


U*2jKx)  -  q] 


Kj  Ka 

>  qT‘p(x>  expf-A-  Y 

x  i-1  i-1  / 

where  q.  =  [yq]..  Similarly,  it  follows  from  (28)  that 
J  ^  J 

K. 

[r*Sj]W  =  yW  exp  JVlj^ 


Now  (39)  has  the  form 


Kj 

[q*2jK*>  =  Ajp(x)  exp  ( -  ^  pij8ij(x)J 


l.  -  ,:‘exp  (-»  - 


and  (40)  has  the  form 


[r*D.](x)  ■  B.p(x)  exp  -  *..s..(x 

~  J  v  \  4.  ij  ij  * 


Bj  *  r. 1 exp(-ft  )  (4 

Since  both  (41)  and  (43)  have  the  same  form,  satisfy  (17),  and  integrate  to 
unity,  it  follows  that  they  are  equal  everywhere  on  J..  Thus 

(po(S’AA,))*Dj  -  (p*S')*I>j  « 

holds,  as  well  as  i.  ■  t..  and  A.  -  B..  Eq.  (32)  then  follows 

J  I  ^  J  J  J 
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directly  from  (20)  and  (45). 


Since  A^  a  Bj  holds,  (42)  and  (44)  yield 


J  J 

Now,  we  have 


K 

f  \ 

r.exp  .9  y  i.a. 


\ 


1  1 


i=l 


V 


(46) 


[(VrJ.Alj  =  rjexpf-V-  ^  ^jaijy 

'  U1  J 


i“l 


(47) 


where  V  and  £.  are  Lagrangian  multipliers  corresponding  respectively  to  a 
normalization  constraint  and  to  (22).  Eq.  (46)  also  satisfies  these 
constraints.  Since  it  has  the  same  form  as  (47),  it  follows  that  the  two 
equations  are  equal;  that  is,  (j>q  *  (Vr)oA  holds,  which  is  (33). 

V  ^ 

Now  (y(poS*))e?A  =  Y((p«S 1  )e>A*  )  holds  as  a  consequence  of  subset 

<V  V 

aggregation  (25)  —  just  substitute  p©S*  for  p  in  (25).  It  follows  from  (33) 
that 


vj>(p*(S'AA*  ))  =  Hj  ( ( poS 1 ) °A*  ) 

holds.  If  it  is  also  true  that 

(p*(S'AA'  ))*g.j  -  ((poS^oA1  )*Dj 

or  q*|)j  "  u*2j  holds,  then  (31)  follows  immediately.  To  see  that  (49) 
does  indeed  hold,  we  use  (9),  (19),  (24),  and  (37)  to  express  u*JK  as 


K. 

J 


[u*Dj]  Cx)  -  u^ptx)  exp  -  f  - 


*..s..(x) 
V]  ~ 


t.r lJ 


(48) 


(49) 


where  uj 


I  V  u] j .  This  has  the  form 


with 


/  \ 

Cj  ■  "j  expl‘$  '  2.A*ij)  • 


i-1 

Now  (43)  and  (50)  differ  only  in  the  leading  factors  B ^  or  C ^ .  Since  they 
both  integrate  to  unity,  it  follows  that  B .  »  C .  and  r*D.  »  u*D .  or 

J  J  -VJ 

( P°S '  )*^j  =  ((poS')»A'  )*IK.  Eq.  (49)  then  follows  from  (45).  This 
completes  the  proofs  of  (31 )— ( 33 ) • 

To  prove  (35)-(36),  we  note  that,  since  the  right  hand  sides  of  (17), 
(18),  (22),  and  (23)  are  zero,  it  follows  from  (12)  that 


f  =  ~H£(p*Dj)»Sj ,  p*Dj]  =  -H[q*Dj,  p*D  ^ ] 


9 


~H[r,p] 

%  =  ~H[q,p] 


(51) 

(52) 

(53) 


all  hold,  where  £ ,  & ,  and  %  are  Lagrangian  multipliers  from  (29),  (27),  and 
(38).  Eq.  (46)  yields 


H[  Vq,  %r] 


M 

TqjlotCqj/r.) 

j  =  l 


K  M 

a 


e-  *  -  I’j-ij 


i=*l  j-1 


* 


Eq.  (35)  then  follows  from  (52)-(53).  Since  (21)  holds  in  general,  (36) 
follows  directly  by  substitution  of  q  for  r.  Alternatively,  we  use  (32)  to 
equate  the  right  side  of  (29)  with  the  right  side  of  (39).  Since 

a  IJij  *  holds  —  as  pointed  out  in  the  discussion  following 

(29)  and  (45)  —  it  follows  that 
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K 

a 

logtq./p.)  -  fj  -  *  ~ 

i»l 

holds.  After  multiplying  this  by  and  summing,  we  obtain  (36)  by 
substituting  (51)  and  (53). 

IV.  COMPUTATIONAL  RESULTS 

In  this  Section  we  present  some  numerical  examples  to  demonstrate  the 
savings  that  can  be  obtained  using  the  decomposition  method  of  computing 
q  «  po(S'AA’).  We  compare  the  following  two  methods  of  computing  q: 

Method  A  (decomposition) . 

a)  obtain  the  posterior  conditional  densities  q*Dj  by  computing 

( p*D .  )o S  .  ~  See  (32); 

n  j 

b)  compute  the  aggregate  distribution  Mj(p*S')  using  the  results  of 
(p*Dj)*Sj  3  (p°S')*Dj  as  explained  in  Section  III(B); 

c)  obtain  the  posterior  aggregate  distribution  Vq  by  computing 
(Y(p«Sf))oA  —  See  (33);  and 

d)  combine  q*Dj  and  V  q  to  obtain  the  full  posterior  q. 


Method  B  (full  space) . 

a)  obtain  q  by  solving  po(S'AAf)  directly. 

In  order  to  compare  the  two  methods,  we  computed  examples  for  several 
values  of  N,  M,  K,  and  K^,  where 

N  *  total  size  of  discrete  state  space  D, 

M  -  number  of  subsets,  each  with  size  N/M, 

1 
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K  »  number  of  constraints  per  subset,  and 


=*  number  of  aggregate  constraints* 

In  these  terms,  the  decomposition  method  requires  solving  M  minimum 
cross-entropy  problems  of  dimension  N/M  each  with  K  constraints  followed  by 
one  problem  of  dimension  M  with  constraints*  The  full  space  method 
requires  solving  a  single  minimum  cross-entropy  problem  of  dimension  N  with  MK 
+  constraints.  Since  the  computational  complexity  of  minimum 
cross-entropy  problems  grows  rapidly  with  the  dimension  of  the  problem,  the 
decomposition  method  can  lead  to  considerable  savings* 

Specifically,  we  used  the  following  procedure,  where  U[0,l]  refers  to  a 
psuedo-random  number  uniformly  distributed  on  [0,1]: 

1*  Construct  a  random  "unknown"  distribution  q^  by  picking  N  U[0,l]  values 
and  normalizing. 


2)  For  each  of  the  subsets  D ^ ,  j  =  1,.*,M,  construct  K  random  subset 
constraints  S^  by  picking  N/M  U[0,l]  values  as  coefficients  for  each 
constraint  and  computing  the  expectations  of  q**D ^ . 


3)  Construct  random  aggregate  constraints  A  by  picking  M  U[0,l] 
values  as  coefficients  for  each  constraint  and  computing  the 
expectations  of  H»q+. 


4)  Construct  a  random  prior  p  by  picking  N  U[0,l]  values  and  normalizing. 


5)  Compute  the  posterior  q  *  p^S’AA1)  by  both  methods  and  compare 
execution  times.  For  the  computations,  we  used  a  slightly  modified 
version  of  the  APL  function  MINCROSSENT  described  in  [14].  The 
modification  made  the  normalization  constant  exp(-^  )  in  (11)  available 
after  the  function  call  as  a  global  variable. 


We  fixed  the  subset  size  at  N/M  ■  10  and  the  number  of  subset  constraints 
per  subset  at  K  ■  5.  We  computed  results  for  M  ■  2, 4, 6, 8  subsets  with 


3 
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LJdli£B 


1 


K  »  M/2  aggregate  constraints  in  each  case.  For  each  set  of  values  N,  My 
a 

K,  and  K&,  we  repeated  the  the  foregoing  procedure  four  times  and  averaged 
the  resulting  execution  times,  which  are  expressed  as  seconds  of  execution 
time  on  an  IBM  370/158  processor.  The  results  are  summarized  in  Table  I.  For 
the  case  of  80  states  and  44  total  constraints,  the  decomposition  method  was 
more  than  ten  times  faster. 


Table  I.  Comparison  of  Decomposition  and  Full-Space  Methods 


No.  of 

Subsets 

Total  No. 

of  States 

Total  No,  of 

Constraints 

Method  A 

(Decomp. ) 

(secs . ) 

Method  B 

(Full) 

(secs . ) 

Ratio  of 

B  to  A 

2 

20 

n 

.45 

.57 

1.3 

4 

40 

22 

00 

ON 

2.9 

3.4 

6 

60 

33 

1.3 

8.9 

6.8 

8  ! 

80 

44 

1.9 

22 

11.6 
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